School of Computer Sciences, Universiti Sains Malaysia.
Semester 2, 2020/2021

CDS513 Assignment 2: Time Series Analysis and Forecasting

Lee Yong Meng (P-COM0012/20)


Overview

Introduction

  • Dataset: Background Story, Attributes
  • Loading Dataset: Training Set, Test Set
  • Combining Dataset

Section 1: ARIMA

Section 2: SARIMAX

Section 3: Machine Learning


Introduction

Dataset

Bike Sharing Demand Data: https://www.kaggle.com/c/bike-sharing-demand/data

Photo by Aaron Doucett on Unsplash

About this dataset (extracted from Kaggle):

Background Story

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

This dataset consists of 2 files:

  • train.csv: records of the first 19 days of each month.
  • test.csv: records of the days from 20th of each month.

Attributes

train.csv and test.csv
  • datetime: hourly date + timestamp
  • season: 1 = spring, 2 = summer, 3 = fall, 4 = winter
  • holiday: whether the day is considered a holiday
  • workingday: whether the day is neither a weekend nor holiday
  • weather:
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp: temperature in Celsius
  • atemp: "feels like" temperature in Celsius
  • humidity: relative humidity
  • windspeed: wind speed
train.csv only
  • casual: number of non-registered user rentals initiated
  • registered: number of registered user rentals initiated
  • count: number of total rentals

Columns/Measurements under consideration: temp, humidity, windspeed
Target column: count

In [1]:
# ================================================================================ 
# Import Pandas Library
# ================================================================================ 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_context("paper")
sns.set_style("whitegrid")
# sns.set(rc={'figure.figsize':(15, 6)})
%matplotlib inline

plt.rcParams["figure.dpi"] = 200

Loading Dataset

Training Set

In [2]:
# ================================================================================ 
# Training set
# ================================================================================ 
df_train = pd.read_csv('src/train.csv')
df_train.head()
Out[2]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1
In [3]:
# Information about the training set
print("> Number of rows and columns in the training set:")
display(df_train.shape)

print("> The general statistics of each column in the training set:")
display(df_train.describe())
display(df_train.describe(include=[object]))

print("> The data type and number of non-null values in each column:\n")
display(df_train.info())
> Number of rows and columns in the training set:
(10886, 12)
> The general statistics of each column in the training set:
season holiday workingday weather temp atemp humidity windspeed casual registered count
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 2.506614 0.028569 0.680875 1.418427 20.23086 23.655084 61.886460 12.799395 36.021955 155.552177 191.574132
std 1.116174 0.166599 0.466159 0.633839 7.79159 8.474601 19.245033 8.164537 49.960477 151.039033 181.144454
min 1.000000 0.000000 0.000000 1.000000 0.82000 0.760000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 2.000000 0.000000 0.000000 1.000000 13.94000 16.665000 47.000000 7.001500 4.000000 36.000000 42.000000
50% 3.000000 0.000000 1.000000 1.000000 20.50000 24.240000 62.000000 12.998000 17.000000 118.000000 145.000000
75% 4.000000 0.000000 1.000000 2.000000 26.24000 31.060000 77.000000 16.997900 49.000000 222.000000 284.000000
max 4.000000 1.000000 1.000000 4.000000 41.00000 45.455000 100.000000 56.996900 367.000000 886.000000 977.000000
datetime
count 10886
unique 10886
top 2012-05-08 21:00:00
freq 1
> The data type and number of non-null values in each column:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
None
In [4]:
# Check missing value
np.sum(df_train.isna())
Out[4]:
datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
dtype: int64

Note: There is no missing value in the training set.

Check if count = casual + registered

In [5]:
# Check data validity
np.sum(df_train['count'] != df_train['casual'] + df_train['registered'])
Out[5]:
0

Note: Verified, none of the rows have different count values from casual + registered.

Therefore, count = casual + registered.

The next thing to do is to explore the correlation between columns/measurements under consideration and count.

(Assume that count is of interest.)

In [6]:
# ================================================================================ 
# Inspect correlation between each feature to the target
# ================================================================================ 
FEATURES = ['temp', 'humidity', 'windspeed']

def display_correlation(df, features, target, ascending=False):
    print(f"> Correlation between selected features with target: \"{target}\"")
    display(df[features+[target]].corr()[target].sort_values(ascending=ascending)[1:])

    
print("Selected features are:")
_ = [print(f"* {f}") for f in FEATURES]
print()

display_correlation(df_train, FEATURES, 'count')
display_correlation(df_train, FEATURES, 'casual')
display_correlation(df_train, FEATURES, 'registered')
Selected features are:
* temp
* humidity
* windspeed

> Correlation between selected features with target: "count"
temp         0.394454
windspeed    0.101369
humidity    -0.317371
Name: count, dtype: float64
> Correlation between selected features with target: "casual"
temp         0.467097
windspeed    0.092276
humidity    -0.348187
Name: casual, dtype: float64
> Correlation between selected features with target: "registered"
temp         0.318571
windspeed    0.091052
humidity    -0.265458
Name: registered, dtype: float64

Note: In all cases, the feature temp (temperature) is highly correlated to different assigned targets.

This is only for train, therefore, it is very likely that the test set have quite different distribution, who knows?

Test Set

In [7]:
# ================================================================================ 
# Test set
# ================================================================================ 
df_test = pd.read_csv('src/test.csv')
df_test
Out[7]:
datetime season holiday workingday weather temp atemp humidity windspeed
0 2011-01-20 00:00:00 1 0 1 1 10.66 11.365 56 26.0027
1 2011-01-20 01:00:00 1 0 1 1 10.66 13.635 56 0.0000
2 2011-01-20 02:00:00 1 0 1 1 10.66 13.635 56 0.0000
3 2011-01-20 03:00:00 1 0 1 1 10.66 12.880 56 11.0014
4 2011-01-20 04:00:00 1 0 1 1 10.66 12.880 56 11.0014
... ... ... ... ... ... ... ... ... ...
6488 2012-12-31 19:00:00 1 0 1 2 10.66 12.880 60 11.0014
6489 2012-12-31 20:00:00 1 0 1 2 10.66 12.880 60 11.0014
6490 2012-12-31 21:00:00 1 0 1 1 10.66 12.880 60 11.0014
6491 2012-12-31 22:00:00 1 0 1 1 10.66 13.635 56 8.9981
6492 2012-12-31 23:00:00 1 0 1 1 10.66 13.635 65 8.9981

6493 rows × 9 columns

In [8]:
# Information of test set
print("> Number of rows and columns in the test set:")
display(df_test.shape)

print("> The general statistics of each column in the test set:")
display(df_test.describe())
display(df_test.describe(include=[object]))

print("> The data type and number of non-null values in each column:\n")
display(df_test.info())
> Number of rows and columns in the test set:
(6493, 9)
> The general statistics of each column in the test set:
season holiday workingday weather temp atemp humidity windspeed
count 6493.000000 6493.000000 6493.000000 6493.000000 6493.000000 6493.000000 6493.000000 6493.000000
mean 2.493300 0.029108 0.685815 1.436778 20.620607 24.012865 64.125212 12.631157
std 1.091258 0.168123 0.464226 0.648390 8.059583 8.782741 19.293391 8.250151
min 1.000000 0.000000 0.000000 1.000000 0.820000 0.000000 16.000000 0.000000
25% 2.000000 0.000000 0.000000 1.000000 13.940000 16.665000 49.000000 7.001500
50% 3.000000 0.000000 1.000000 1.000000 21.320000 25.000000 65.000000 11.001400
75% 3.000000 0.000000 1.000000 2.000000 27.060000 31.060000 81.000000 16.997900
max 4.000000 1.000000 1.000000 4.000000 40.180000 50.000000 100.000000 55.998600
datetime
count 6493
unique 6493
top 2011-04-24 08:00:00
freq 1
> The data type and number of non-null values in each column:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    6493 non-null   object 
 1   season      6493 non-null   int64  
 2   holiday     6493 non-null   int64  
 3   workingday  6493 non-null   int64  
 4   weather     6493 non-null   int64  
 5   temp        6493 non-null   float64
 6   atemp       6493 non-null   float64
 7   humidity    6493 non-null   int64  
 8   windspeed   6493 non-null   float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.7+ KB
None

Combining Dataset

In [9]:
df_train.head()
Out[9]:
datetime season holiday workingday weather temp atemp humidity windspeed casual registered count
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 16
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 32
3 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0.0 3 10 13
4 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0.0 0 1 1
In [10]:
# ================================================================================ 
# Generate time series data
# ================================================================================ 
FEATURE = 'temp'
DATETIME = 'datetime'

def generate_time_series_data(df, feature, datetime):
    df_new = df.copy()
    df_new[datetime] = pd.to_datetime(df_new[datetime])
    df_new = df_new[[datetime, feature]]
    df_new.set_index(datetime, drop=True, inplace=True)
    
    return df_new

generate_time_series_data(df_train, FEATURE, DATETIME).head()

print("Successful!")
Successful!
In [11]:
df_train_ts = generate_time_series_data(df_train, FEATURE, DATETIME)
df_train_ts.describe()
Out[11]:
temp
count 10886.00000
mean 20.23086
std 7.79159
min 0.82000
25% 13.94000
50% 20.50000
75% 26.24000
max 41.00000
In [12]:
df_test_ts = generate_time_series_data(df_test, FEATURE, DATETIME)
df_test_ts.describe()
Out[12]:
temp
count 6493.000000
mean 20.620607
std 8.059583
min 0.820000
25% 13.940000
50% 21.320000
75% 27.060000
max 40.180000
In [13]:
# Average Hourly Temperature
df_temp = df_train_ts.append(df_test_ts)
df_temp.sort_index(inplace=True)
df_temp.plot(figsize=(10, 4))
# df_temp.to_csv('output/hourly_temp.csv')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x29234641b50>
In [14]:
# ================================================================================ 
# Daily temperature data
# ================================================================================ 
df_temp_daily = df_temp.resample("d").mean()
ax = df_temp_daily.plot(figsize=(10, 4), title="Average Daily Temperature: Jan 2011 - Dec 2012", legend=False)
ax.set_xlabel('Date')
ax.set_ylabel('Temperature')
print()
# df_temp_daily.to_csv('output/daily_temp.csv')

By looking at the line chart, we can immediately tell that this time series data is not stationary.


Section 1: ARIMA

Step 1: Check Stationarity

Check stationarity of the daily temperature data by inspecting the mean and variance at 17 different regions.

1.1. Augmented Dickey-Fullter (ADF) Test

We use Augmented Dickey-Fuller test to check stationarity of our time series data.

In [15]:
# ================================================================================ 
# Check stationarity using ADF test
# ================================================================================ 

# Import adfuller() function from statsmodels package.
from statsmodels.tsa.stattools import adfuller

# Define helper function to check stationarity using ADF test.
def check_stationarity(timeseries):
    result = adfuller(timeseries, autolag='AIC')
    
    # Generate Pandas Series using the result of the ADF test.
    dfoutput = pd.Series(result[0:4], 
                         index=['Test Statistic',
                                'p-value', 
                                'Number of Lags Used', 
                                'Number of Observations Used'])
    display(dfoutput)
    
    print('The test statistic: %f' % result[0])
    print('p-value: %f' % result[1])

    # Print critical values for the test statistics at 1%, 5%, and 10% level. 
    print('Critical Values:')
    for key, value in result[4].items():
        print('%s: %.3f' % (key, value))

Note: Recall that:

  • Null Hypothesis, $H_0$: There is unit root in the time series data - the data is not stationary.
  • Alternative Hypothesis, $H_a$: There is no unit root in the time series data - the data is stationary.

ADF Test Using (Average) Daily Temperature Data

In [16]:
# ================================================================================ 
# Check stationarity of daily temperature data using the ADF test.
# ================================================================================ 
check_stationarity(df_temp_daily)
Test Statistic                  -1.966226
p-value                          0.301575
Number of Lags Used              6.000000
Number of Observations Used    724.000000
dtype: float64
The test statistic: -1.966226
p-value: 0.301575
Critical Values:
1%: -3.439
5%: -2.866
10%: -2.569

Note: The $p$-value is greater than 0.05, indicating that the time series data of the daily temperature is not stationary.


Step 2: Transforming the Time Series Data

2.1. Differencing the Time Series Data

Apply differencing on the daily temperature data. This is the first order differencing.

In [17]:
# ================================================================================ 
# Differencing the daily temperature data
# ================================================================================ 
first_diff = np.diff(df_temp_daily[FEATURE])
df_first_diff = pd.DataFrame(first_diff, columns=['First Difference'], index=df_temp_daily.index[1:])

plt.figure(figsize=(10, 4))
plt.title("Average Daily Temperature after First Differencing")
plt.plot(df_first_diff, label="First Difference")

plt.xlabel("Date")
plt.ylabel("Temp Difference")

plt.show()
In [18]:
df_first_diff
Out[18]:
First Difference
datetime
2011-01-02 7.917754e-01
2011-01-03 -6.851700e+00
2011-01-04 1.490909e-01
2011-01-05 1.105217e+00
2011-01-06 -9.269565e-01
... ...
2012-12-27 4.441667e-01
2012-12-28 -3.416667e-02
2012-12-29 -3.552714e-15
2012-12-30 1.025000e-01
2012-12-31 -1.640000e+00

730 rows × 1 columns

ADF Test using Daily Temperature Data after Differencing

In [19]:
print("[ADF Test]")
print("=== First Difference: Original ===")
check_stationarity(first_diff)
[ADF Test]
=== First Difference: Original ===
Test Statistic                -8.687842e+00
p-value                        4.116934e-14
Number of Lags Used            1.500000e+01
Number of Observations Used    7.140000e+02
dtype: float64
The test statistic: -8.687842
p-value: 0.000000
Critical Values:
1%: -3.440
5%: -2.866
10%: -2.569

Note: The $p$-value is less than 0.05, implying that the daily temperature after first differencing is stationary (since we have strong evidence to reject $H_0$).

Therefore, the proper integration order $q$ for the daily temperature data is 1.


Step 3: Determine $p$ and $q$ of ARIMA Model

3.1. Determine $p$ and $q$ from the PACF and ACF

We use Autocorrelation (ACF) and Partial Autocorrelation (PACF) to identify the values of $q$ and $p$ of the ARIMA model respectively, where:

  • $p$ is the lag order.
  • $q$ is the moving average order.
In [20]:
# ================================================================================ 
# PACF and ACF of the daily temperature data
# ================================================================================ 
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm

temp_diff = first_diff

# Set plot figure size
fig = plt.figure(figsize=(10,6))

# Plot ACF from 'df'
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(temp_diff, lags=60, ax=ax1)

# Plot PACF from 'df'
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(temp_diff, lags=60, ax=ax2)

It's difficult to tell which $p$ and $q$ should be used.

  • Autocorrelation (ACF): $q$ is roughly 4.
  • Partial Autocorrelation (PACF): $p$ is roughly 6.

Therefore, we use autoarima function implemented in Python's pmdarima package to search for the optimal orders for an ARIMA model to fit the daily temperature data.

3.2. Auto ARIMA

Apply Auto ARIMA on the daily temperature data (before first differencing).

In [21]:
# ================================================================================ 
# ARIMA: Determine p, d, q using Auto ARIMA
# ================================================================================ 
import pmdarima as pm
auto_arima_fit = pm.auto_arima(df_temp_daily[FEATURE], 
#                                test='adf',
                               start_p=1, start_q=1,
                               max_p=7, max_q=7, 
                               seasonal=False, # if True: SARIMA; otherwise: ARIMA
                               trace=True, # if True: print debugging information; otherwise: no print
                               error_action='ignore',
                               suppress_warnings=True,
                               stepwise=True)
Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=3280.550, Time=0.27 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=3350.358, Time=0.02 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=3350.763, Time=0.06 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=3348.914, Time=0.07 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=3348.365, Time=0.02 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=3215.991, Time=0.17 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=3288.547, Time=0.10 sec
 ARIMA(3,1,1)(0,0,0)[0] intercept   : AIC=3217.986, Time=0.26 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=3217.986, Time=0.28 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=3224.353, Time=0.23 sec
 ARIMA(3,1,0)(0,0,0)[0] intercept   : AIC=3260.698, Time=0.12 sec
 ARIMA(3,1,2)(0,0,0)[0] intercept   : AIC=3218.902, Time=0.35 sec
 ARIMA(2,1,1)(0,0,0)[0]             : AIC=3213.991, Time=0.07 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=3278.552, Time=0.06 sec
 ARIMA(2,1,0)(0,0,0)[0]             : AIC=3286.552, Time=0.04 sec
 ARIMA(3,1,1)(0,0,0)[0]             : AIC=3215.986, Time=0.12 sec
 ARIMA(2,1,2)(0,0,0)[0]             : AIC=3215.986, Time=0.10 sec
 ARIMA(1,1,0)(0,0,0)[0]             : AIC=3348.769, Time=0.03 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=3222.355, Time=0.09 sec
 ARIMA(3,1,0)(0,0,0)[0]             : AIC=3258.703, Time=0.05 sec
 ARIMA(3,1,2)(0,0,0)[0]             : AIC=3216.902, Time=0.17 sec

Best model:  ARIMA(2,1,1)(0,0,0)[0]          
Total fit time: 2.742 seconds

The best ARIMA model for daily temperature is ARIMA(2,1,1) - it has the lowest AIC value.

Note: Negative AIC value is smaller, thus better model.


Step 4: Fit an ARIMA Model

Define ARIMA(2,1,1) model using ARIMA implemented in the Python's statsmodels package.

In [22]:
import statsmodels

statsmodels.__version__
Out[22]:
'0.12.2'
In [23]:
# ================================================================================ 
# ARIMA: Build Model
# ================================================================================ 
import warnings
warnings.filterwarnings("ignore") # specify to ignore warning messages

from statsmodels.tsa.arima_model import ARIMA

# Generate ARIMA estimator using temperature and specified order: (p, d, q)
mod = ARIMA(df_temp_daily[FEATURE], order=(2, 1, 1), freq='D')

# results = mod.fit(disp=0)
results = mod.fit(disp=0)

# Print the test result of the ARIMA estimator/model
print(results.summary())
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                 D.temp   No. Observations:                  730
Model:                 ARIMA(2, 1, 1)   Log Likelihood               -1602.996
Method:                       css-mle   S.D. of innovations              2.174
Date:                Sun, 13 Jun 2021   AIC                           3215.991
Time:                        18:11:32   BIC                           3238.957
Sample:                    01-02-2011   HQIC                          3224.852
                         - 12-31-2012                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const          3.68e-05      0.028      0.001      0.999      -0.054       0.054
ar.L1.D.temp     0.7149      0.047     15.303      0.000       0.623       0.806
ar.L2.D.temp    -0.3217      0.038     -8.548      0.000      -0.395      -0.248
ma.L1.D.temp    -0.7919      0.038    -20.654      0.000      -0.867      -0.717
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.1112           -1.3689j            1.7631           -0.1415
AR.2            1.1112           +1.3689j            1.7631            0.1415
MA.1            1.2627           +0.0000j            1.2627            0.0000
-----------------------------------------------------------------------------

Step 5: Residual Diagnosis

A model residuals is the difference between the predicted and expected value and can be verified using the fitted model property resid.

Residual object is of type ndarray so we will store it in a DataFrame for plotting.

In the below line plot we don’t see any large residuals and all of them are within their upper and lower limits.

In [24]:
# ================================================================================ 
# ARIMA: Residual plot
# ================================================================================ 
residuals = pd.DataFrame(results.resid, columns=['Residuals'])
ax = residuals.plot(figsize=(10, 4), title="Average Daily Temperature: Residual Plot", legend=False)
ax.set_xlabel("Date")
ax.set_ylabel("Residual")
print()

Next we will check if these residuals are normally distributed and looks Gaussian or not. So we will plot the density plot to check this. This looks normal with a long left tail and centered at zero.

In [25]:
residuals.plot(kind='kde', figsize=(10, 4))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x292389eedf0>

The mean of the residual is close to zero and there is no significant correlation also that we can see in the PACF plot for residuals.

In [26]:
# ================================================================================ 
# ARIMA: ACF and PACF of the residual
# ================================================================================ 
display(residuals.describe())
# plt.figure(figsize=(10, 4))

plt.rc("figure", figsize=(10, 4))
plot_pacf(residuals, lags=40)
plot_acf(residuals, lags=40)
print()
Residuals
count 730.000000
mean -0.009347
std 2.178097
min -7.621418
25% -1.276494
50% 0.033653
75% 1.275451
max 8.474293

The residual diagnostics looks like a white noise since 95% of our sample autocorrelations is within the blue shaded region and it meets all our criteria for a good forecast and prediction.

So let’s go ahead and evaluate the predicted and expected values.


Step 6: ARIMA Predict

Predict the daily temperatures using the ARIMA model. The actual and predicted temperatures for the first 660 days are plotted for comparison (days_pred).

Note: When the integration order $d$ of the ARIMA model is not 0, the predict() function returns the predicted $d$-order differencing values.

Therefore, upon plotting the predicted temperatures by the ARIMA model, the predicted values are added back to the values of the first differencing.

In [27]:
# ================================================================================ 
# ARIMA: Original and Predicted Plot
# ================================================================================ 
from math import sqrt
from sklearn.metrics import mean_squared_error

days_pred = 660

plt.figure(figsize=(10, 4))

# The first value from the original data.
temp_pred = [df_temp_daily[FEATURE][0]]

for diff in results.predict().values:
    temp_new = temp_pred[-1] - diff
    temp_pred.append(temp_new)
    
temp_pred_series = pd.Series(temp_pred[1:], index=results.predict().index)
# print(len(temp_pred))

# Plot temperatures: before reverse transform
plt.plot(df_temp_daily[FEATURE][:days_pred], label='Temperature: Actual', color='r')
# plt.plot(results.predict(1, days_pred) + temp_pred[1:days_pred+1], label='Temperature: Predicted', color='g') # Predict the difference
plt.plot(temp_pred_series[:days_pred], label='Temperature: Predicted', color='g') # Predict the difference

plt.title("Average Daily Temperature: Observed vs Predicted")
plt.xlabel("Date")
plt.ylabel("Temperature")
plt.legend()
plt.show()

Step 7: ARIMA Forecast

Forecast the daily temperature from 1 - 20 Jan 2013 (20 days) using ARIMA(2,1,1) model.

In [28]:
# type(results)
In [29]:
# ================================================================================ 
# ARIMA: Forecast values
# ================================================================================ 
# # Set # months (20 days)
n = 20

# Generate forecast for the next 3 years
forecast, err, ci = results.forecast(steps=n)

# Create new Dataframe using forecast value from 1 Jan 2020.
df_forecast = pd.DataFrame({'forecast': forecast}, 
                           index=pd.date_range(start='1/1/2013', periods=n, freq='D'))

print(df_forecast.head(20))

df_forecast.shape
             forecast
2013-01-01   9.173129
2013-01-02   9.932334
2013-01-03  10.370918
2013-01-04  10.440267
2013-01-05  10.348781
2013-01-06  10.261089
2013-01-07  10.227847
2013-01-08  10.232314
2013-01-09  10.246223
2013-01-10  10.254753
2013-01-11  10.256399
2013-01-12  10.254854
2013-01-13  10.253242
2013-01-14  10.252609
2013-01-15  10.252698
2013-01-16  10.252987
2013-01-17  10.253187
2013-01-18  10.253260
2013-01-19  10.253270
2013-01-20  10.253276
Out[29]:
(20, 1)
In [30]:
# ================================================================================ 
# ARIMA: Forecast Plot
# ================================================================================ 
# Plot actual data from the 1,600th record.
date_sep = '2012-09-01'
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]

ax = df_temp_daily[FEATURE][-num_days_since_sep_01:].plot(label='Temperature: Actual', figsize=(10, 4))
ax = temp_pred_series[-num_days_since_sep_01:].plot(ax=ax, label='Temperature: Predicted')
df_forecast.plot(ax=ax, label='Temperature: Forecast')

# Plot confidence interval (if not NaN, when p-value > 0.05)
ax.fill_between(df_forecast.index,
                ci[:, 0], ci[:, 1], 
                color='g', alpha=.25)

ax.set_xlabel('Date')
ax.set_ylabel('Temperature')

plt.title("Average Daily Temperature: Actual vs Predicted vs Forecast")
plt.legend(['Temperature: Actual', 'Temperature: Predicted', 'Temperature: Forecast'])
plt.show()

Check different types of errors generated by the ARIMA model when predicting daily temperatures.

MAE: Mean absolute error

$$\text{MAE} = \frac{1}{N}\sum_{i=1}^N{\left|y_{true} - y_{pred}\right|}$$

MRE: Mean relative error

$$\text{MRE} = \frac{1}{N}\sum_{i=1}^N{\left|\frac{y_{true} - y_{pred}}{y_{true}}\right|}$$

MSE: Mean squared error

$$\text{MSE} = \frac{1}{N}\sum_{i=1}^N{\left(y_{true} - y_{pred}\right)^2}$$

RMSE: Root mean squared error

$$\text{RMSE} = \sqrt{\text{MSE}}$$
In [31]:
# ================================================================================ 
# ARIMA: RMSE and MSE
# ================================================================================ 

# Load specific evaluation tools
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tools.eval_measures import rmse

# date_sep = '2012-09-01'

# Relative error
def mean_relative_error(y_true, y_pred,):
    import numpy as np
    relative_error = np.average(np.abs(y_true - y_pred) / y_true, axis=0)
    return relative_error

# mae_calc = mean_absolute_error(df_temp_daily[FEATURE][1:], temp_pred_series)
# mre_calc = mean_relative_error(df_temp_daily[FEATURE][1:], temp_pred_series)
# mse_calc = mean_squared_error(df_temp_daily[FEATURE][1:], temp_pred_series)
# rmse_calc = rmse(df_temp_daily[FEATURE][1:], temp_pred_series)

num_days = df_temp_daily.shape[0]
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]

temp_true = df_temp_daily[FEATURE][-num_days_since_sep_01:]
temp_pred = temp_pred_series[-num_days_since_sep_01:]

mae_calc = mean_absolute_error(temp_true, temp_pred)
mre_calc = mean_relative_error(temp_true, temp_pred)
mse_calc = mean_squared_error(temp_true, temp_pred)
rmse_calc = rmse(temp_true, temp_pred)

print('MAE: ', mae_calc)
print('MRE: ', mre_calc)
print('MSE: ', mse_calc)
print('RMSE:', rmse_calc)
MAE:  9.924519347432767
MRE:  0.5410133953708663
MSE:  132.13868239184018
RMSE: 11.495159085103614

Section 2: SARIMAX (Seasonal ARIMA Exogenous)

SARIMAX is a variation of ARIMA, which helps to handle the seasonality components in the time series data. In this assignment, a SARIMAX model is also built to fit the daily temperature data - which is treated under the category of ARIMA model.


Step 1: Determine Orders for SARIMAX Model using Auto ARIMA

Set m=7 for daily temperature data - implying the weekly (7-day) seasonality within the daily temperature data.

In [32]:
# ================================================================================ 
# SARIMAX: Determine orders using Auto ARIMA
# ================================================================================ 
Sarimax_model = pm.auto_arima(df_temp_daily[FEATURE],
                              start_P=1, start_q=1,
                              max_p=3, max_q=3,
                              m=7,
                              seasonal=True,
                              d=None,
                              D=1,
                              trace=True,
                              error_action='ignore',
                              suppress_warnings=True,
                              stepwise=True)
Performing stepwise search to minimize aic
 ARIMA(2,0,1)(1,1,1)[7] intercept   : AIC=3313.406, Time=0.52 sec
 ARIMA(0,0,0)(0,1,0)[7] intercept   : AIC=4033.213, Time=0.02 sec
 ARIMA(1,0,0)(1,1,0)[7] intercept   : AIC=3456.928, Time=0.17 sec
 ARIMA(0,0,1)(0,1,1)[7] intercept   : AIC=3476.326, Time=0.16 sec
 ARIMA(0,0,0)(0,1,0)[7]             : AIC=4031.214, Time=0.02 sec
 ARIMA(2,0,1)(0,1,1)[7] intercept   : AIC=3312.409, Time=0.38 sec
 ARIMA(2,0,1)(0,1,0)[7] intercept   : AIC=3582.962, Time=0.18 sec
 ARIMA(2,0,1)(0,1,2)[7] intercept   : AIC=3313.355, Time=0.72 sec
 ARIMA(2,0,1)(1,1,0)[7] intercept   : AIC=3401.793, Time=0.28 sec
 ARIMA(2,0,1)(1,1,2)[7] intercept   : AIC=3315.551, Time=0.88 sec
 ARIMA(1,0,1)(0,1,1)[7] intercept   : AIC=3315.504, Time=0.25 sec
 ARIMA(2,0,0)(0,1,1)[7] intercept   : AIC=3326.118, Time=0.23 sec
 ARIMA(3,0,1)(0,1,1)[7] intercept   : AIC=inf, Time=1.39 sec
 ARIMA(2,0,2)(0,1,1)[7] intercept   : AIC=inf, Time=1.54 sec
 ARIMA(1,0,0)(0,1,1)[7] intercept   : AIC=3335.988, Time=0.26 sec
 ARIMA(1,0,2)(0,1,1)[7] intercept   : AIC=inf, Time=1.11 sec
 ARIMA(3,0,0)(0,1,1)[7] intercept   : AIC=inf, Time=0.56 sec
 ARIMA(3,0,2)(0,1,1)[7] intercept   : AIC=3313.495, Time=1.04 sec
 ARIMA(2,0,1)(0,1,1)[7]             : AIC=3310.451, Time=0.20 sec
 ARIMA(2,0,1)(0,1,0)[7]             : AIC=3580.963, Time=0.09 sec
 ARIMA(2,0,1)(1,1,1)[7]             : AIC=3311.444, Time=0.33 sec
 ARIMA(2,0,1)(0,1,2)[7]             : AIC=3311.392, Time=0.52 sec
 ARIMA(2,0,1)(1,1,0)[7]             : AIC=3399.796, Time=0.18 sec
 ARIMA(2,0,1)(1,1,2)[7]             : AIC=3313.591, Time=0.59 sec
 ARIMA(1,0,1)(0,1,1)[7]             : AIC=3313.554, Time=0.16 sec
 ARIMA(2,0,0)(0,1,1)[7]             : AIC=3324.162, Time=0.17 sec
 ARIMA(3,0,1)(0,1,1)[7]             : AIC=inf, Time=0.67 sec
 ARIMA(2,0,2)(0,1,1)[7]             : AIC=inf, Time=0.82 sec
 ARIMA(1,0,0)(0,1,1)[7]             : AIC=3334.014, Time=0.12 sec
 ARIMA(1,0,2)(0,1,1)[7]             : AIC=inf, Time=1.02 sec
 ARIMA(3,0,0)(0,1,1)[7]             : AIC=inf, Time=0.43 sec
 ARIMA(3,0,2)(0,1,1)[7]             : AIC=3311.534, Time=0.64 sec

Best model:  ARIMA(2,0,1)(0,1,1)[7]          
Total fit time: 15.674 seconds

The best SARIMAX model for daily temperature (for m=7) is SARIMAX(2,0,1)(0,1,1,7) - again, it has the lowest AIC value.


Step 2: Fit SARIMAX Model

Define SARIMAX(2,0,1)(0,1,1,7) model using ARIMA implemented in the Python's statsmodels package.

  • Set order=(2,0,1)
  • Set seasonal_order=(0,1,1,7)
In [33]:
# ================================================================================ 
# SARIMAX: Build model
# ================================================================================ 
import statsmodels.api as sm
from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(df_temp_daily[FEATURE], order=(2, 0, 1),
              seasonal_order=(0, 1, 1, 7),
              enforce_stationarity=False,
              enforce_invertibility=False)

results = model.fit()
print(results.summary())
                                     SARIMAX Results                                     
=========================================================================================
Dep. Variable:                              temp   No. Observations:                  731
Model:             SARIMAX(2, 0, 1)x(0, 1, 1, 7)   Log Likelihood               -1628.022
Date:                           Sun, 13 Jun 2021   AIC                           3266.043
Time:                                   18:11:50   BIC                           3288.905
Sample:                               01-01-2011   HQIC                          3274.872
                                    - 12-31-2012                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.4472      0.124      3.599      0.000       0.204       0.691
ar.L2          0.2470      0.113      2.187      0.029       0.026       0.468
ma.L1          0.5342      0.117      4.577      0.000       0.305       0.763
ma.S.L7       -0.7319      0.027    -26.868      0.000      -0.785      -0.679
sigma2         5.5436      0.260     21.295      0.000       5.033       6.054
===================================================================================
Ljung-Box (L1) (Q):                   0.13   Jarque-Bera (JB):                11.73
Prob(Q):                              0.71   Prob(JB):                         0.00
Heteroskedasticity (H):               0.64   Skew:                            -0.05
Prob(H) (two-sided):                  0.00   Kurtosis:                         3.62
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Step 3: Residual Diagnosis (SARIMAX)

Using the same steps in ARIMA Step 5, the residual, ACF and PACF of the SARIMAX model are plotted.

In [34]:
# ================================================================================ 
# SARIMAX: Residual plot
# ================================================================================ 
residuals = pd.DataFrame(results.resid, columns=['Residual'])
ax = residuals.plot(figsize=(10, 4), title="Average Daily Temperature: Residual Plot", legend=False)
ax.set_xlabel('Date')
ax.set_ylabel('Residual')

residuals.plot(kind='kde')

plt.rc("figure", figsize=(10, 4))

plot_pacf(residuals, lags=40)
plot_acf(residuals, lags=40)

print()

The residual diagnostics looks like a white noise since 95% of our sample autocorrelations is within (or very close) the blue shaded region and it meets all our criteria for a good forecast and prediction.


Step 4: SARIMAX Predict

Predict the daily temperatures using the SARIMAX model. The actual and predicted temperatures for the first 660 days are plotted for comparison (days_pred).

In [35]:
# ================================================================================ 
# SARIMAX: Original and Predicted Plot
# ================================================================================ 
days_pred = 660

plt.figure(figsize=(10, 4))

plt.plot(df_temp_daily[FEATURE][:days_pred], label='Temperature: Actual', color='r')
plt.plot(results.predict(0, days_pred), label='Temperature: Predicted', color='g' )
plt.title("Average Daily Temperature: Actual vs Predicted")
plt.xlabel("Date")
plt.ylabel("Temperature")

plt.legend()
plt.show()

Step 5: SARIMAX Forecast

Forecast the daily temperature from 1 - 24 Jan 2013 (24 days) using the SARIMAX model.

In [36]:
# ================================================================================ 
# SARIMAX: Forecast values
# ================================================================================ 
forecast = results.predict(start=len(df_temp_daily[FEATURE]),
                           end=len(df_temp_daily[FEATURE])+24,
                           typ='levels').rename('data sarimax (1,0,1) forecast')
In [37]:
# ================================================================================ 
# SARIMAX: Forecast plot
# ================================================================================
date_sep = '2012-09-01'
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]
num_days = df_temp_daily.shape[0]

ax = df_temp_daily[FEATURE][-num_days_since_sep_01:].plot(figsize=(10, 4), legend=True, label='Temperature: Actual')
ax = results.predict(num_days-num_days_since_sep_01, num_days).plot(ax=ax, legend=True, label='Temperature: Predicted')
ax = forecast.plot(ax=ax, legend=True, label='Temperature: Forecast')
ax.set_title("Average Daily Temperature: Actual vs Predicted vs Forecast")
ax.set_xlabel('Date')
ax.set_ylabel('Temperature')

print()

Check different types of errors generated by the ARIMA model when predicting daily temperatures.

In [38]:
# ================================================================================ 
# SARIMAX: Error
# ================================================================================ 
# Load specific evaluation tools
from sklearn.metrics import mean_squared_error
from statsmodels.tools.eval_measures import rmse

num_days = df_temp_daily.shape[0]
num_days_since_sep_01 = df_temp_daily[df_temp_daily.index >= date_sep].shape[0]

temp_true = df_temp_daily[FEATURE][-num_days_since_sep_01:]
temp_pred = results.predict(num_days-num_days_since_sep_01, num_days-1)

mae_calc = mean_absolute_error(temp_true, temp_pred)
mre_calc = mean_relative_error(temp_true, temp_pred)
mse_calc = mean_squared_error(temp_true, temp_pred)
rmse_calc = rmse(temp_true, temp_pred)

print('MAE: ', mae_calc)
print('MRE: ', mre_calc)
print('MSE: ', mse_calc)
print('RMSE:', rmse_calc)
MAE:  1.6161546193577057
MRE:  0.10380703069486576
MSE:  4.420535814962723
RMSE: 2.102507030894956

Section 3: Machine Learning

In this section, different machine learning models are implemented and their performances on time series prediction tasks are evaluated using different error metrics.

Finally, the best model is applied to perform the forecasting task on the daily temperature data.

In [39]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import seaborn as sns
from numpy import asarray
from pandas import concat
from pandas import DataFrame

# sns.set(rc={'figure.figsize':(10, 6)})
sns.set_context("paper")
sns.set_style("whitegrid")
# sns.set(rc={'figure.figsize':(15, 6)})
%matplotlib inline

plt.rcParams["figure.dpi"] = 200
In [40]:
df = pd.read_csv('output/daily_temp.csv', index_col='datetime')
df.index = pd.to_datetime(df.index)
df.head()
Out[40]:
temp
datetime
2011-01-01 14.110833
2011-01-02 14.902609
2011-01-03 8.050909
2011-01-04 8.200000
2011-01-05 9.305217

Refresh: Define temp as the target column, then plot the daily temperature data (i.e., time series data).

In [41]:
FEATURE = 'temp'

df[FEATURE].plot(figsize=(10, 4))
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x292681cb700>

Step 1: Data Preparation

Transform the time series prediction and forecasting tasks into regression tasks by defining new features using rolling window technique.

1.1: Rolling Window

Generate the 7-day average temperature values using the daily temperature readings from the past 7 days. Define new column temp_7_day_mean to store these values.

In [42]:
#7-day rolling mean
df.rolling(7).mean()[0:21]
Out[42]:
temp
datetime
2011-01-01 NaN
2011-01-02 NaN
2011-01-03 NaN
2011-01-04 NaN
2011-01-05 NaN
2011-01-06 NaN
2011-01-07 10.143603
2011-01-08 9.094198
2011-01-09 7.775492
2011-01-10 7.508815
2011-01-11 7.327776
2011-01-12 7.010147
2011-01-13 6.779681
2011-01-14 6.570862
2011-01-15 6.971100
2011-01-16 7.517766
2011-01-17 7.664195
2011-01-18 7.942853
2011-01-19 8.642469
2011-01-20 9.208659
2011-01-21 9.306066
In [43]:
df['temp_7_day_mean'] = df[FEATURE].rolling(window=7).mean()

# df_7day = pd.DataFrame(df[[FEATURE, 'temp_7_day_mean']], index=df.index)

ax = df[[FEATURE, 'temp_7_day_mean']].plot(figsize=(10, 4))
# ax = df_7day.plot(figsize=(10, 4))
ax.set_title("Average Daily Temperature: Actual vs 7-Day Rolling Windows")
ax.set_xlabel("Date")
ax.set_ylabel("Temperature")
ax.legend(['Temperature: Actual', 'Temperature: 7-Day Rolling Windows'])

print()

Generate the 30-day average temperature values using the daily temperature readings from the past 30 days. Define new column temp_30_day_mean to store these values.

In [63]:
df['temp_30_day_mean'] = df[FEATURE].rolling(window=30).mean()
ax = df[[FEATURE, 'temp_30_day_mean']].plot(figsize=(10, 4))

ax.set_title("Average Daily Temperature: Actual vs 30-Day Rolling Windows")
ax.set_xlabel("Date")
ax.set_ylabel("Temperature")
ax.legend(['Temperature: Actual', 'Temperature: 30-Day Rolling Windows'])

print()

In [45]:
df
Out[45]:
temp temp_7_day_mean temp_30_day_mean
datetime
2011-01-01 14.110833 NaN NaN
2011-01-02 14.902609 NaN NaN
2011-01-03 8.050909 NaN NaN
2011-01-04 8.200000 NaN NaN
2011-01-05 9.305217 NaN NaN
... ... ... ...
2012-12-27 10.420833 10.885160 13.539309
2012-12-28 10.386667 10.455637 13.480087
2012-12-29 10.386667 10.382422 13.442454
2012-12-30 10.489167 10.440994 13.384371
2012-12-31 8.849167 10.350378 13.271621

731 rows × 3 columns

1.2: Train-Test Split

Selected features are

  • temp_7_day_mean
  • temp_30_day_mean

which are generated using rolling window techniques. Then, split the data into training and test set with the following condition:

  • Training set: Records from 1 January 2011 - 31 August 2012
  • Test set: Records from 1 September - 31 December 2012
In [46]:
#assigning the train, test, label and features.

features = ['temp_7_day_mean', 'temp_30_day_mean'] #features used
label = FEATURE # label

date_sep = '2012-09-01'

test_df = df[df.index >= date_sep] # index for test data
train_df = df[df.index < date_sep] # index for train data

train_df = train_df.dropna()

X_train, y_train = train_df[features], train_df[label] #assign train data
X_test, y_test = test_df[features], test_df[label] # assign test data

Step 2: Machine Learning Models

Define helper functions to calculate errors and plot actual and predicted values using Machine Learning models.

In [47]:
# Define helper functions
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def calculate_errors(regressor, x_test, y_test):
    mae_calc = mean_absolute_error(y_test, regressor.predict(x_test))
    mre_calc = mean_relative_error(y_test, regressor.predict(x_test))
    mse_calc = mean_squared_error(y_test, regressor.predict(x_test))
    rmse_calc = rmse(y_test, regressor.predict(x_test))

    print('MAE: ', mae_calc)
    print('MRE: ', mre_calc)
    print('MSE: ', mse_calc)
    print('RMSE:', rmse_calc)

    
def plot_actual_and_predicted_values(df, regressor, x_test, y_test):
    predictions = regressor.predict(x_test)
    df['predictions'] = predictions
    
    ax = df[[FEATURE, 'predictions']].plot(figsize=(10, 4))

    ax.set_title("Average Daily Temperature: Actual vs Predicted")
    ax.set_xlabel("Date")
    ax.set_ylabel("Temperature")
    ax.legend(['Temperature: Actual', 'Temperature: Predicted'])
    print()

Model 1: XGBoost Regressor (learning rate = 0.01)

In [48]:
## Xgb with set value learning_rate = 0.01
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor

# # Decision Tree Regressor
# reg1 = DecisionTreeRegressor(random_state=0)
# reg1.fit(X_train, y_train)

# # Bagging Regressor
# reg1 = BaggingRegressor(DecisionTreeRegressor(), random_state=0, n_estimators=10)
# reg1.fit(X_train, y_train)

# XGBRegressor
reg1 = XGBRegressor(n_estimators=100, learning_rate=0.01)
reg1.fit(X_train, y_train, 
         eval_set=[(X_train, y_train), (X_test, y_test)], 
         eval_metric='mae')

print("\nCalculated errors:")
calculate_errors(reg1, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg1, X_test, y_test)
[0]	validation_0-mae:20.65548	validation_1-mae:17.53571
[1]	validation_0-mae:20.45071	validation_1-mae:17.35966
[2]	validation_0-mae:20.24797	validation_1-mae:17.18536
[3]	validation_0-mae:20.04724	validation_1-mae:17.01278
[4]	validation_0-mae:19.84850	validation_1-mae:16.84191
[5]	validation_0-mae:19.65173	validation_1-mae:16.67177
[6]	validation_0-mae:19.45690	validation_1-mae:16.50332
[7]	validation_0-mae:19.26402	validation_1-mae:16.33747
[8]	validation_0-mae:19.07303	validation_1-mae:16.17234
[9]	validation_0-mae:18.88395	validation_1-mae:16.00976
[10]	validation_0-mae:18.69677	validation_1-mae:15.84848
[11]	validation_0-mae:18.51141	validation_1-mae:15.68820
[12]	validation_0-mae:18.32793	validation_1-mae:15.52895
[13]	validation_0-mae:18.14623	validation_1-mae:15.37182
[14]	validation_0-mae:17.96637	validation_1-mae:15.21572
[15]	validation_0-mae:17.78825	validation_1-mae:15.06170
[16]	validation_0-mae:17.61194	validation_1-mae:14.90869
[17]	validation_0-mae:17.43738	validation_1-mae:14.75712
[18]	validation_0-mae:17.26451	validation_1-mae:14.60765
[19]	validation_0-mae:17.09339	validation_1-mae:14.45914
[20]	validation_0-mae:16.92404	validation_1-mae:14.31246
[21]	validation_0-mae:16.75636	validation_1-mae:14.16749
[22]	validation_0-mae:16.59035	validation_1-mae:14.02370
[23]	validation_0-mae:16.42604	validation_1-mae:13.88145
[24]	validation_0-mae:16.26327	validation_1-mae:13.74038
[25]	validation_0-mae:16.10220	validation_1-mae:13.60100
[26]	validation_0-mae:15.94264	validation_1-mae:13.46260
[27]	validation_0-mae:15.78475	validation_1-mae:13.32591
[28]	validation_0-mae:15.62834	validation_1-mae:13.19014
[29]	validation_0-mae:15.47378	validation_1-mae:13.05427
[30]	validation_0-mae:15.32053	validation_1-mae:12.92059
[31]	validation_0-mae:15.16872	validation_1-mae:12.78827
[32]	validation_0-mae:15.01869	validation_1-mae:12.65640
[33]	validation_0-mae:14.86992	validation_1-mae:12.52719
[34]	validation_0-mae:14.72269	validation_1-mae:12.39901
[35]	validation_0-mae:14.57681	validation_1-mae:12.27168
[36]	validation_0-mae:14.43263	validation_1-mae:12.14499
[37]	validation_0-mae:14.28999	validation_1-mae:12.02077
[38]	validation_0-mae:14.14869	validation_1-mae:11.89794
[39]	validation_0-mae:14.00861	validation_1-mae:11.77588
[40]	validation_0-mae:13.87011	validation_1-mae:11.65486
[41]	validation_0-mae:13.73305	validation_1-mae:11.53503
[42]	validation_0-mae:13.59735	validation_1-mae:11.41656
[43]	validation_0-mae:13.46323	validation_1-mae:11.29940
[44]	validation_0-mae:13.33038	validation_1-mae:11.18325
[45]	validation_0-mae:13.19864	validation_1-mae:11.06996
[46]	validation_0-mae:13.06841	validation_1-mae:10.95580
[47]	validation_0-mae:12.93928	validation_1-mae:10.84558
[48]	validation_0-mae:12.81174	validation_1-mae:10.73653
[49]	validation_0-mae:12.68515	validation_1-mae:10.62849
[50]	validation_0-mae:12.56014	validation_1-mae:10.52013
[51]	validation_0-mae:12.43630	validation_1-mae:10.41390
[52]	validation_0-mae:12.31348	validation_1-mae:10.30896
[53]	validation_0-mae:12.19210	validation_1-mae:10.20537
[54]	validation_0-mae:12.07211	validation_1-mae:10.10024
[55]	validation_0-mae:11.95303	validation_1-mae:9.99889
[56]	validation_0-mae:11.83508	validation_1-mae:9.89720
[57]	validation_0-mae:11.71844	validation_1-mae:9.79719
[58]	validation_0-mae:11.60280	validation_1-mae:9.69751
[59]	validation_0-mae:11.48875	validation_1-mae:9.59875
[60]	validation_0-mae:11.37556	validation_1-mae:9.50209
[61]	validation_0-mae:11.26331	validation_1-mae:9.40536
[62]	validation_0-mae:11.15230	validation_1-mae:9.31073
[63]	validation_0-mae:11.04243	validation_1-mae:9.21591
[64]	validation_0-mae:10.93371	validation_1-mae:9.12305
[65]	validation_0-mae:10.82611	validation_1-mae:9.03054
[66]	validation_0-mae:10.71950	validation_1-mae:8.93853
[67]	validation_0-mae:10.61424	validation_1-mae:8.84617
[68]	validation_0-mae:10.50985	validation_1-mae:8.75665
[69]	validation_0-mae:10.40636	validation_1-mae:8.66748
[70]	validation_0-mae:10.30396	validation_1-mae:8.57950
[71]	validation_0-mae:10.20280	validation_1-mae:8.49075
[72]	validation_0-mae:10.10225	validation_1-mae:8.39952
[73]	validation_0-mae:10.00273	validation_1-mae:8.31352
[74]	validation_0-mae:9.90416	validation_1-mae:8.22411
[75]	validation_0-mae:9.80682	validation_1-mae:8.14101
[76]	validation_0-mae:9.71038	validation_1-mae:8.05848
[77]	validation_0-mae:9.61479	validation_1-mae:7.97152
[78]	validation_0-mae:9.52045	validation_1-mae:7.89153
[79]	validation_0-mae:9.42702	validation_1-mae:7.81092
[80]	validation_0-mae:9.33450	validation_1-mae:7.72637
[81]	validation_0-mae:9.24296	validation_1-mae:7.64786
[82]	validation_0-mae:9.15228	validation_1-mae:7.56941
[83]	validation_0-mae:9.06250	validation_1-mae:7.48706
[84]	validation_0-mae:8.97407	validation_1-mae:7.41126
[85]	validation_0-mae:8.88623	validation_1-mae:7.33009
[86]	validation_0-mae:8.79956	validation_1-mae:7.26210
[87]	validation_0-mae:8.71355	validation_1-mae:7.18296
[88]	validation_0-mae:8.62827	validation_1-mae:7.10949
[89]	validation_0-mae:8.54396	validation_1-mae:7.03158
[90]	validation_0-mae:8.46079	validation_1-mae:6.96042
[91]	validation_0-mae:8.37800	validation_1-mae:6.89029
[92]	validation_0-mae:8.29619	validation_1-mae:6.81810
[93]	validation_0-mae:8.21510	validation_1-mae:6.74734
[94]	validation_0-mae:8.13481	validation_1-mae:6.67961
[95]	validation_0-mae:8.05539	validation_1-mae:6.61085
[96]	validation_0-mae:7.97718	validation_1-mae:6.54575
[97]	validation_0-mae:7.89923	validation_1-mae:6.47866
[98]	validation_0-mae:7.82237	validation_1-mae:6.41153
[99]	validation_0-mae:7.74608	validation_1-mae:6.34570

Calculated errors:
MAE:  6.345694663133623
MRE:  0.33893174617266875
MSE:  48.5666227159832
RMSE: 6.968975729329468

Model 2: XGBoost Regressor (learning rate = 0.1)

In [49]:
## Xgb with set value learning_rate = 0.1
reg2 = XGBRegressor(n_estimators=100, learning_rate=0.1)
reg2.fit(X_train, 
        y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='mae')

print("\nCalculated errors:")
calculate_errors(reg2, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg2, X_test, y_test)
[0]	validation_0-mae:18.79411	validation_1-mae:15.93544
[1]	validation_0-mae:16.93130	validation_1-mae:14.31846
[2]	validation_0-mae:15.25384	validation_1-mae:12.86592
[3]	validation_0-mae:13.74513	validation_1-mae:11.53992
[4]	validation_0-mae:12.38733	validation_1-mae:10.35749
[5]	validation_0-mae:11.16814	validation_1-mae:9.30693
[6]	validation_0-mae:10.06904	validation_1-mae:8.33664
[7]	validation_0-mae:9.07960	validation_1-mae:7.49052
[8]	validation_0-mae:8.19209	validation_1-mae:6.73589
[9]	validation_0-mae:7.39863	validation_1-mae:6.07020
[10]	validation_0-mae:6.68234	validation_1-mae:5.42167
[11]	validation_0-mae:6.04270	validation_1-mae:4.90834
[12]	validation_0-mae:5.47192	validation_1-mae:4.43503
[13]	validation_0-mae:4.96112	validation_1-mae:4.01124
[14]	validation_0-mae:4.50729	validation_1-mae:3.69173
[15]	validation_0-mae:4.10372	validation_1-mae:3.41011
[16]	validation_0-mae:3.75220	validation_1-mae:3.13229
[17]	validation_0-mae:3.43800	validation_1-mae:2.87696
[18]	validation_0-mae:3.15625	validation_1-mae:2.73780
[19]	validation_0-mae:2.91574	validation_1-mae:2.58743
[20]	validation_0-mae:2.69999	validation_1-mae:2.46596
[21]	validation_0-mae:2.50919	validation_1-mae:2.35346
[22]	validation_0-mae:2.34934	validation_1-mae:2.25307
[23]	validation_0-mae:2.20669	validation_1-mae:2.19657
[24]	validation_0-mae:2.08496	validation_1-mae:2.18404
[25]	validation_0-mae:1.97758	validation_1-mae:2.17896
[26]	validation_0-mae:1.88548	validation_1-mae:2.16441
[27]	validation_0-mae:1.80113	validation_1-mae:2.16038
[28]	validation_0-mae:1.72704	validation_1-mae:2.15785
[29]	validation_0-mae:1.66734	validation_1-mae:2.13475
[30]	validation_0-mae:1.60465	validation_1-mae:2.14613
[31]	validation_0-mae:1.55527	validation_1-mae:2.14162
[32]	validation_0-mae:1.51685	validation_1-mae:2.12776
[33]	validation_0-mae:1.48043	validation_1-mae:2.12774
[34]	validation_0-mae:1.44971	validation_1-mae:2.13567
[35]	validation_0-mae:1.42566	validation_1-mae:2.12837
[36]	validation_0-mae:1.40667	validation_1-mae:2.12265
[37]	validation_0-mae:1.38701	validation_1-mae:2.12785
[38]	validation_0-mae:1.36496	validation_1-mae:2.14524
[39]	validation_0-mae:1.34548	validation_1-mae:2.14389
[40]	validation_0-mae:1.33154	validation_1-mae:2.14271
[41]	validation_0-mae:1.31684	validation_1-mae:2.15787
[42]	validation_0-mae:1.30428	validation_1-mae:2.15577
[43]	validation_0-mae:1.29353	validation_1-mae:2.15571
[44]	validation_0-mae:1.27623	validation_1-mae:2.16978
[45]	validation_0-mae:1.26810	validation_1-mae:2.16915
[46]	validation_0-mae:1.26078	validation_1-mae:2.16863
[47]	validation_0-mae:1.25408	validation_1-mae:2.16816
[48]	validation_0-mae:1.24667	validation_1-mae:2.16782
[49]	validation_0-mae:1.23360	validation_1-mae:2.17442
[50]	validation_0-mae:1.22795	validation_1-mae:2.17442
[51]	validation_0-mae:1.22053	validation_1-mae:2.17847
[52]	validation_0-mae:1.20819	validation_1-mae:2.19444
[53]	validation_0-mae:1.19840	validation_1-mae:2.19598
[54]	validation_0-mae:1.18648	validation_1-mae:2.20861
[55]	validation_0-mae:1.18194	validation_1-mae:2.20861
[56]	validation_0-mae:1.17253	validation_1-mae:2.20969
[57]	validation_0-mae:1.16832	validation_1-mae:2.20969
[58]	validation_0-mae:1.15812	validation_1-mae:2.22527
[59]	validation_0-mae:1.15216	validation_1-mae:2.22532
[60]	validation_0-mae:1.14820	validation_1-mae:2.22540
[61]	validation_0-mae:1.14046	validation_1-mae:2.22356
[62]	validation_0-mae:1.12730	validation_1-mae:2.22509
[63]	validation_0-mae:1.12409	validation_1-mae:2.22515
[64]	validation_0-mae:1.12111	validation_1-mae:2.22519
[65]	validation_0-mae:1.11196	validation_1-mae:2.23670
[66]	validation_0-mae:1.10152	validation_1-mae:2.25762
[67]	validation_0-mae:1.09431	validation_1-mae:2.25936
[68]	validation_0-mae:1.08864	validation_1-mae:2.26848
[69]	validation_0-mae:1.07963	validation_1-mae:2.28183
[70]	validation_0-mae:1.07272	validation_1-mae:2.28293
[71]	validation_0-mae:1.06974	validation_1-mae:2.28292
[72]	validation_0-mae:1.06272	validation_1-mae:2.28112
[73]	validation_0-mae:1.05717	validation_1-mae:2.28111
[74]	validation_0-mae:1.04932	validation_1-mae:2.28513
[75]	validation_0-mae:1.04449	validation_1-mae:2.29431
[76]	validation_0-mae:1.03279	validation_1-mae:2.30327
[77]	validation_0-mae:1.03034	validation_1-mae:2.30327
[78]	validation_0-mae:1.02228	validation_1-mae:2.31292
[79]	validation_0-mae:1.01567	validation_1-mae:2.31387
[80]	validation_0-mae:1.01160	validation_1-mae:2.31547
[81]	validation_0-mae:1.00396	validation_1-mae:2.31846
[82]	validation_0-mae:0.99643	validation_1-mae:2.31969
[83]	validation_0-mae:0.99479	validation_1-mae:2.31969
[84]	validation_0-mae:0.98499	validation_1-mae:2.31535
[85]	validation_0-mae:0.98073	validation_1-mae:2.31535
[86]	validation_0-mae:0.97579	validation_1-mae:2.32394
[87]	validation_0-mae:0.96444	validation_1-mae:2.33435
[88]	validation_0-mae:0.95916	validation_1-mae:2.33585
[89]	validation_0-mae:0.95677	validation_1-mae:2.33584
[90]	validation_0-mae:0.94673	validation_1-mae:2.32929
[91]	validation_0-mae:0.94486	validation_1-mae:2.32931
[92]	validation_0-mae:0.94124	validation_1-mae:2.33090
[93]	validation_0-mae:0.93745	validation_1-mae:2.33292
[94]	validation_0-mae:0.92626	validation_1-mae:2.33517
[95]	validation_0-mae:0.92453	validation_1-mae:2.33518
[96]	validation_0-mae:0.92140	validation_1-mae:2.33673
[97]	validation_0-mae:0.91826	validation_1-mae:2.34221
[98]	validation_0-mae:0.90961	validation_1-mae:2.34390
[99]	validation_0-mae:0.90572	validation_1-mae:2.34323

Calculated errors:
MAE:  2.343232329584934
MRE:  0.14194615770433255
MSE:  8.916774972622191
RMSE: 2.9860969462866054

Model 3: XGBoost Regressor (learning rate = 0.3)

In [50]:
## Xgb with set value learning_rate = 0.3

reg3 = XGBRegressor(n_estimators=100, learning_rate=0.3)
reg3.fit(X_train, 
        y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='mae')


print("\nCalculated errors:")
calculate_errors(reg3, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg3, X_test, y_test)
[0]	validation_0-mae:14.65772	validation_1-mae:12.37928
[1]	validation_0-mae:10.31104	validation_1-mae:8.55989
[2]	validation_0-mae:7.28759	validation_1-mae:5.83976
[3]	validation_0-mae:5.21727	validation_1-mae:4.15735
[4]	validation_0-mae:3.82040	validation_1-mae:3.19428
[5]	validation_0-mae:2.88815	validation_1-mae:2.51705
[6]	validation_0-mae:2.31488	validation_1-mae:2.18740
[7]	validation_0-mae:1.94677	validation_1-mae:2.09665
[8]	validation_0-mae:1.71200	validation_1-mae:2.01519
[9]	validation_0-mae:1.56535	validation_1-mae:1.99617
[10]	validation_0-mae:1.46563	validation_1-mae:2.09687
[11]	validation_0-mae:1.39708	validation_1-mae:2.08971
[12]	validation_0-mae:1.35308	validation_1-mae:2.09887
[13]	validation_0-mae:1.31815	validation_1-mae:2.10748
[14]	validation_0-mae:1.28566	validation_1-mae:2.14787
[15]	validation_0-mae:1.27260	validation_1-mae:2.14096
[16]	validation_0-mae:1.25633	validation_1-mae:2.18102
[17]	validation_0-mae:1.24391	validation_1-mae:2.18240
[18]	validation_0-mae:1.20519	validation_1-mae:2.20395
[19]	validation_0-mae:1.18047	validation_1-mae:2.21961
[20]	validation_0-mae:1.17169	validation_1-mae:2.21975
[21]	validation_0-mae:1.16230	validation_1-mae:2.22119
[22]	validation_0-mae:1.14694	validation_1-mae:2.22150
[23]	validation_0-mae:1.11834	validation_1-mae:2.24788
[24]	validation_0-mae:1.10320	validation_1-mae:2.24739
[25]	validation_0-mae:1.09133	validation_1-mae:2.25516
[26]	validation_0-mae:1.08522	validation_1-mae:2.25525
[27]	validation_0-mae:1.05941	validation_1-mae:2.25972
[28]	validation_0-mae:1.02007	validation_1-mae:2.23691
[29]	validation_0-mae:0.99740	validation_1-mae:2.26526
[30]	validation_0-mae:0.97752	validation_1-mae:2.26513
[31]	validation_0-mae:0.96756	validation_1-mae:2.26511
[32]	validation_0-mae:0.95960	validation_1-mae:2.27158
[33]	validation_0-mae:0.92329	validation_1-mae:2.29315
[34]	validation_0-mae:0.90924	validation_1-mae:2.29584
[35]	validation_0-mae:0.90337	validation_1-mae:2.29578
[36]	validation_0-mae:0.88648	validation_1-mae:2.30190
[37]	validation_0-mae:0.85591	validation_1-mae:2.30426
[38]	validation_0-mae:0.83406	validation_1-mae:2.31770
[39]	validation_0-mae:0.81953	validation_1-mae:2.31733
[40]	validation_0-mae:0.81720	validation_1-mae:2.31733
[41]	validation_0-mae:0.79924	validation_1-mae:2.31925
[42]	validation_0-mae:0.77873	validation_1-mae:2.32884
[43]	validation_0-mae:0.75545	validation_1-mae:2.34082
[44]	validation_0-mae:0.74783	validation_1-mae:2.33939
[45]	validation_0-mae:0.73584	validation_1-mae:2.33623
[46]	validation_0-mae:0.71553	validation_1-mae:2.34263
[47]	validation_0-mae:0.70576	validation_1-mae:2.34830
[48]	validation_0-mae:0.69394	validation_1-mae:2.35912
[49]	validation_0-mae:0.68535	validation_1-mae:2.35563
[50]	validation_0-mae:0.66591	validation_1-mae:2.37662
[51]	validation_0-mae:0.65631	validation_1-mae:2.38316
[52]	validation_0-mae:0.64730	validation_1-mae:2.39734
[53]	validation_0-mae:0.64552	validation_1-mae:2.39732
[54]	validation_0-mae:0.64048	validation_1-mae:2.39731
[55]	validation_0-mae:0.62598	validation_1-mae:2.39632
[56]	validation_0-mae:0.61716	validation_1-mae:2.39361
[57]	validation_0-mae:0.61402	validation_1-mae:2.39359
[58]	validation_0-mae:0.60555	validation_1-mae:2.39269
[59]	validation_0-mae:0.59703	validation_1-mae:2.39271
[60]	validation_0-mae:0.58079	validation_1-mae:2.39479
[61]	validation_0-mae:0.56666	validation_1-mae:2.40764
[62]	validation_0-mae:0.55630	validation_1-mae:2.40830
[63]	validation_0-mae:0.54978	validation_1-mae:2.41204
[64]	validation_0-mae:0.54537	validation_1-mae:2.41610
[65]	validation_0-mae:0.54262	validation_1-mae:2.41761
[66]	validation_0-mae:0.53494	validation_1-mae:2.41894
[67]	validation_0-mae:0.51932	validation_1-mae:2.42166
[68]	validation_0-mae:0.50932	validation_1-mae:2.41978
[69]	validation_0-mae:0.50708	validation_1-mae:2.42259
[70]	validation_0-mae:0.50154	validation_1-mae:2.42190
[71]	validation_0-mae:0.49922	validation_1-mae:2.42190
[72]	validation_0-mae:0.48591	validation_1-mae:2.43823
[73]	validation_0-mae:0.48117	validation_1-mae:2.43823
[74]	validation_0-mae:0.46966	validation_1-mae:2.46682
[75]	validation_0-mae:0.46547	validation_1-mae:2.46681
[76]	validation_0-mae:0.45655	validation_1-mae:2.46558
[77]	validation_0-mae:0.45289	validation_1-mae:2.46744
[78]	validation_0-mae:0.44716	validation_1-mae:2.46679
[79]	validation_0-mae:0.44497	validation_1-mae:2.46679
[80]	validation_0-mae:0.43518	validation_1-mae:2.47501
[81]	validation_0-mae:0.42791	validation_1-mae:2.47271
[82]	validation_0-mae:0.42505	validation_1-mae:2.47295
[83]	validation_0-mae:0.41484	validation_1-mae:2.47441
[84]	validation_0-mae:0.40704	validation_1-mae:2.48067
[85]	validation_0-mae:0.40253	validation_1-mae:2.48123
[86]	validation_0-mae:0.39905	validation_1-mae:2.48106
[87]	validation_0-mae:0.39132	validation_1-mae:2.47565
[88]	validation_0-mae:0.38554	validation_1-mae:2.47726
[89]	validation_0-mae:0.38438	validation_1-mae:2.47724
[90]	validation_0-mae:0.38162	validation_1-mae:2.47725
[91]	validation_0-mae:0.37764	validation_1-mae:2.47724
[92]	validation_0-mae:0.37185	validation_1-mae:2.48441
[93]	validation_0-mae:0.36983	validation_1-mae:2.48534
[94]	validation_0-mae:0.36344	validation_1-mae:2.50198
[95]	validation_0-mae:0.35669	validation_1-mae:2.49928
[96]	validation_0-mae:0.35373	validation_1-mae:2.49896
[97]	validation_0-mae:0.35014	validation_1-mae:2.49792
[98]	validation_0-mae:0.34516	validation_1-mae:2.50103
[99]	validation_0-mae:0.33920	validation_1-mae:2.50490

Calculated errors:
MAE:  2.5048957415644004
MRE:  0.14894126022374252
MSE:  9.59174015531357
RMSE: 3.097053463425126

Model 4: Decision Tree Regressor

In [51]:
# Decision Tree Regressor
reg4 = DecisionTreeRegressor(random_state=0)
reg4.fit(X_train, y_train)

print("\nCalculated errors:")
calculate_errors(reg4, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg4, X_test, y_test)
Calculated errors:
MAE:  2.9011248083111947
MRE:  0.16788128767498953
MSE:  13.001131132432896
RMSE: 3.6057081318976576

Model 5: Bagging Regressor

In [52]:
# Bagging Regressor
reg5 = BaggingRegressor(DecisionTreeRegressor(), random_state=0, n_estimators=10)
reg5.fit(X_train, y_train)

print("\nCalculated errors:")
calculate_errors(reg5, X_test, y_test)
plot_actual_and_predicted_values(test_df, reg5, X_test, y_test)
Calculated errors:
MAE:  2.086231543877837
MRE:  0.12516884716291504
MSE:  6.655657701832612
RMSE: 2.579856139755202


Step 3: The Best Model

The best model is Model 5, which is the bagging model with decision tree as the base estimator.

The prediction of this model yields the least errors between the actual and the predicted daily temperatures.

Calculated errors:
MAE:  2.086231543877837
MRE:  0.12516884716291504
MSE:  6.655657701832612
RMSE: 2.579856139755202
In [53]:
predictions = reg5.predict(X_test)
In [54]:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 731 entries, 2011-01-01 to 2012-12-31
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   temp              731 non-null    float64
 1   temp_7_day_mean   725 non-null    float64
 2   temp_30_day_mean  702 non-null    float64
dtypes: float64(3)
memory usage: 22.8 KB
In [55]:
test_df['predictions'] = predictions
In [56]:
test_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 122 entries, 2012-09-01 to 2012-12-31
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   temp              122 non-null    float64
 1   temp_7_day_mean   122 non-null    float64
 2   temp_30_day_mean  122 non-null    float64
 3   predictions       122 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB
In [57]:
plot_actual_and_predicted_values(test_df, reg5, X_test, y_test)

In [58]:
test_df
Out[58]:
temp temp_7_day_mean temp_30_day_mean predictions
datetime
2012-09-01 30.886667 29.251548 29.177194 30.832000
2012-09-02 28.563333 29.505357 29.082667 30.483500
2012-09-03 29.007500 29.529762 28.965361 30.227250
2012-09-04 29.759167 29.515119 28.906139 30.227250
2012-09-05 30.203333 29.817738 28.884500 29.451667
... ... ... ... ...
2012-12-27 10.420833 10.885160 13.539309 8.767167
2012-12-28 10.386667 10.455637 13.480087 9.556417
2012-12-29 10.386667 10.382422 13.442454 8.258083
2012-12-30 10.489167 10.440994 13.384371 9.556417
2012-12-31 8.849167 10.350378 13.271621 8.230750

122 rows × 4 columns


Step 4: Forecast of Time Series

Define helper function to transform a time series data into the format suitable for supervised learning tasks.

In [59]:
# ================================================================================
# Define helper function to transform a time series data into a supervised
# ... learning data.
# ================================================================================
def series_to_supervised(data, n_in=2, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[0]
    df = DataFrame(data)
    cols = list()
    
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
    
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
    
    # put it all together
    agg = concat(cols, axis=1)
    
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    
    return agg.values

Forecast the daily temperature from 1 - 10 Jan 2013 (10 days) using the best Machine Learning model.

In [60]:
# ================================================================================
# Since we have build the model, in this stage we are going to use the whole set
# ... of data again, but only select the temperature column. 
# ... This is because we only want to do forecasting on the temperature column.
# ================================================================================
values = df[FEATURE].values # select column temperature.
preds = [] 

# ================================================================================
# Define the number of forecasting. 10 indicates 10 new value will be forecasted. 
# ... You can change it accordingly.
# ================================================================================
num_days_pred = 10

for i in range(num_days_pred): 
    # transform the time series data into supervised learning. 
    # n_in=6 meaning you are using the first 6 value to forecast the value ahead
    forecast = series_to_supervised(values, n_in=6)
    
    # split into input and output columns
    forecastI, forecastO = forecast[:, :-1], forecast[:, -1]
    
    # fit model. Took back the value the we built before.
#     model = XGBRegressor(n_estimators=100, learning_rate=0.1)
    model = BaggingRegressor(DecisionTreeRegressor(), random_state=0, n_estimators=10)
    model.fit(forecastI, forecastO)
    
    # construct an input for a new preduction
    row = values[-6:].flatten()
    
    # make a prediction
    yhat = model.predict(asarray([row]))
    print('Input: %s, Predicted: %.3f' % (row, yhat[0]))
    
    values = np.append(values, yhat)
    preds.append(yhat)
Input: [ 9.97666667 10.42083333 10.38666667 10.38666667 10.48916667  8.84916667], Predicted: 8.390
Input: [10.42083333 10.38666667 10.38666667 10.48916667  8.84916667  8.39014493], Predicted: 9.074
Input: [10.38666667 10.38666667 10.48916667  8.84916667  8.39014493  9.07377536], Predicted: 8.735
Input: [10.38666667 10.48916667  8.84916667  8.39014493  9.07377536  8.73466377], Predicted: 8.616
Input: [10.48916667  8.84916667  8.39014493  9.07377536  8.73466377  8.61577565], Predicted: 8.951
Input: [8.84916667 8.39014493 9.07377536 8.73466377 8.61577565 8.95071357], Predicted: 8.678
Input: [8.39014493 9.07377536 8.73466377 8.61577565 8.95071357 8.67822614], Predicted: 8.337
Input: [9.07377536 8.73466377 8.61577565 8.95071357 8.67822614 8.33706352], Predicted: 7.980
Input: [8.73466377 8.61577565 8.95071357 8.67822614 8.33706352 7.97982191], Predicted: 8.458
Input: [8.61577565 8.95071357 8.67822614 8.33706352 7.97982191 8.45846162], Predicted: 7.930
In [61]:
import datetime 
df_preds = pd.DataFrame(preds, index=df.index[-len(preds):] + datetime.timedelta(days=num_days_pred))
In [62]:
ax = df[df.index >= date_sep][FEATURE].plot(figsize=(10, 4), legend=True, label='Temperature: Actual')
ax = test_df['predictions'].plot(ax=ax, label='Temperature: Predicted')
ax = df_preds.plot(ax=ax, legend=True, label='Temperature: Forecast')

ax.set_title("Average Daily Temperature: Actual vs Predicted vs Forecast")
ax.set_xlabel("Date")
ax.set_ylabel("Temperature")
ax.legend(['Temperature: Actual', 'Temperature: Predicted', 'Temperature: Forecast'])

print()

In [ ]: